Main Research Questions:
- Which variables and classroom factors are the best predictors of teacher job satisfaction?
- How do teacher satisfaction, as well as some of the significant predictors found from question 1, vary by country?
This analysis will focus on the Teaching and Learning International Survey (TALIS) Dataset. This data set was coordinated by the Organization for Economic Cooperation and Development (OECD), an intergovernmental organization of industrialized countries. The TALIS dataset comes from a survey of the teaching workforce that aims to collect information about teaching as a profession, the working conditions of teachers, and the learning environments of schools. Another goal of the TALIS survey was to provide a more global view of education systems, and to produce metrics that could be used to compare education systems from different countries.
TALIS was administered in 2008 and 2013. We will be using the more recent international dataset, from 2013. It contains answers from teachers from more than 10,000 schools in 36 different countries. Not all of the countries were English speaking. The only eligibility requirement was that countries must be industrialized. The TALIS dataset contains several different files. The data is split by school type (primary school, secondary school etc). Although the TALIS dataset contains information from a variety of school types (primary, secondary, etc), the focus of the survey was on lower-secondary schools, and that is the largest data file. As a result, we are going to focus our analysis on lower-secondary schools. At each school that was selected to participate, the principal (or head administrator) and a random selection of up to 22 teachers were chosen to complete voluntary online questionnaires. Our analysis below consists of only the teacher responses, not the responses of the principals, as they were given slightly different questionnaires.
The first question we hope to explore in our analysis, is: (1) Which variables and classroom factors are the best predictors of teacher job satisfaction?
There are several reasons motivating this question. Education is incredibly important to any countries development and functioning. It’s important that a country can educate its youth, so that they can participate in society when they are adults. However, there is a fair amount of diversity across countries with regard to how well the teaching profession is respected, and how well education systems are funded. As a result, teacher working conditions vary quite a bit. However, in order to make sure that there are enough teachers, and that they are able to teach their students well, it is important to make sure that teacher satisfaction is at a high enough level. If teacher satisfaction is too low, it becomes difficult to retain a large enough teaching force, and to attract talented people to the field. Low teacher satisfaction can also indicate that they are having trouble doing their job well, possibly due to environmental factors out of their control. For many reasons such as these, teacher satisfaction is a useful metric to observe in any education system. It’s also helpful to understand what kinds of environments keep teachers more satisfied, so that educational systems can encourage and build these types of environments. Therefore, understanding how teacher satisfaction varies across the 36 countries in the data set, and which aspects of a school and environment correlate most with teacher satisfaction, could be helpful information.
Our second research question is: (2) How do teacher satisfaction, as well as some of the significant predictors found from question 1, vary by country?
Each row of the dataset corresponds to single teacher’s response to the questionnaire (There are 117876 rows). Each column in the dataset corresponds to either a particular question in the questionnaire, or a summary variable for a section of the questionnaire.
The dataset contains a column for each individual question on the questionnaire, and summary columns for overall sections. For example, there are individual columns for questions such as “The advantages of this profession clearly outweigh the disadvantages” and “I would recommend my school as a good place to work”, but also an overall column summing up all the scores for the teacher satisfaction questions. There are columns concerning employment status, personal information/background, time spent, professional development, efficacy and amount of feedback, outlook about the school, school climate, overall job satisfaction, and much more. The specific variables we are using are listed below. Almost all of the questions in the questionnaire were multiple choice or numerical input questions.
The variables we are using are:
country (CNTRY): Country of teacher (factor variable, 36 levels)region: Region the teacher’s country is in (factor variable, 8 levels)gender (TT2G01): Teacher Gender (factor variable, 2 levels)age (TT2G02): Teacher Age (numeric variable)years_teacher (TT2G05B): Years of teaching experience (numeric variable)train_stat (TT2G11): Training level of teacher (factor variable)prep_A/B/C (TT2G13A/B/C): How prepared the teacher felt (ordered factor variable, 4 levels)avg_prep: The average of each teachers responses for prep_A/B/C (numeric variable between 1 and 4)total_time (TT2G16): Total time spent on teaching related activities in hours per week (numeric variable)feedback_salary (TT2G30G): Whether performance feedback influences salary (ordered factor variable, 4 levels)collab (TT2G44E): How collaborative is the school culture? (ordered factor variable, 4 levels)num_students (TT2G38): Number of students in the teacher’s class (numeric variable)satisfaction (TJOBSATS): Self rated teacher job satisfaction (ordered factor variable, 4 levels)Various Subject Names (TT2G15[A - L]): 12 columns representing which subjects the teacher teaches (boolean numeric variable)num_subjects: The number of subjects the teacher teaches (numeric variable)sat1-sat10 (TT2G46[A - J]): Ten questions related to teacher job satisfaction (ordered factor variables, 4 levels)sat_avg: An average of the ten questions related to teacher job satisfaction (numeric variable)Below are three plots that show our data and hopefully motivate our research questions and the analysis.
Plot 1 provides a full histogram of teacher satisfaction. As we are later analyzing the effect of different variables on teacher satisfaction, and the variation between different regions, it is important to note the overall distribution of the data.
Plot 2 provides a box plot of collab vs sat_avg. This plot is an example of one we would find in regard to our first research question, in which we assess which variables are particularly predictive of overall teacher satisfaction. In looking at this particular plot, we can actually see a strong visual trend between how collaborative a teacher viewed his/her school environment to be, and how satisfied that teacher was. We will provide similar plots for other variables in the analysis of our first research question, as well as more rigorous analysis to be described below.
Plot 3 provides a box plot of geographic region vs sat_avg. This is relevant to our second research question, which revolves around variations in sat_avg and other variables across different geographical regions. Visually, while all median values for sat_avg hover around the range of 2.5 to 3.5, there is still a noticeable difference in some of the distributions (e.g. the median and both quartiles of the satisfaction value for North America is noticeably above that of Eastern Europe). We will explore additional plots like this one, but with other variables, in our full analysis.
Before we analyze the data any further, we should clean up the data slightly more. In previous assignments, we cleaned the dataset with respect to missing values and obvious errors. However, when considering the context of some of our variables, some of the entries are still unusual. We have decided to set these values to NA, since they are likely unreliable.
The total time spent on teaching related activities per week:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see from this histogram, that most teachers seem to have a 40 hour work week, which is a pretty typical time comittment. There is also another pretty clear dropoff after 60, which also seems like a reasonable addition to the 40 hour workweek for teachers that spend a fair amount of time on class planning and grading in the evenings. We can also see that there are some teachers claiming to have spent over 90 hours per week on teaching related activities. Even if a teacher worked from 6 am to 9 pm Monday through Friday (15 hours a day) due to time at school and grading in the evening, and another 15 hours of related teaching activities on the weekend such as class planning and grading, they would be at 90 hours in a week. Although this is a physically possible time commitment, it seems highly unlikely. It is possible that edge cases such as teachers at boarding schools could claim this kind of time comittment, but other than unusual situations like that, more than 90 hours a week seems highly improbable. Given this fact, let’s see how many teachers reported spending 90 hours a week or more on teaching related activities.
length(teacher_data[teacher_data$total_time>=90 & !is.na(teacher_data$total_time),]$total_time)
## [1] 795
length(teacher_data[teacher_data$total_time>90 & !is.na(teacher_data$total_time),]$total_time)
## [1] 210
We can see that there are quite a few values of exactly 90, given the fact that the length of the vector where total_time is greater than or equal to 90 is so much longer than the vector that is strictly greater than 90. Given this fact, we will keep the values of 90, but set all values that are higher than 90 hours a week to NA.
teacher_data[teacher_data$total_time > 90 & !is.na(teacher_data$total_time), "total_time"] <- NA
We can now take a look at reported age vs years teaching.
We can see that there are points above the red line, which are teachers who reported they had been teaching longer than they had been alive. Indeed, let’s see how many responses reported years that would indicate that they had started teaching before age 15 (points above the yellow line), and remove those values.
age_start <- teacher_data$age - teacher_data$years_teacher
nrow(teacher_data[age_start < 15 & !is.na(age_start),])
## [1] 232
teacher_data[age_start < 15 & !is.na(age_start), "years_teacher"] <- NA
To address our first research question, and to try to identify which variables are the best indicators of teacher job satisfaction, we are going to attempt to use several techniques.
Although our earlier plot of satisfaction against the degree of collaboration at the school shows a clear linear trend, let’s look at a matrix of plots of the data, before diving into any linear regression models:
Dataset for Linear Regression Model: To start, for our linear regression model, we are going to remove columns with a large number of NAs, the subject data columns (we’re only interested in the number of subjects each teacher is teaching, not the specific subjects), and the 10 satisfaction related questions, as they were used to create our resposne variable.
One of the things we are interested in is how geography impacts teacher satisfaction, however we have two different geography variables. We can’t include both in the model, as they are linearly dependent, and R will only find a set of coefficients for one of the two variables if we put both in the model. Because of this, let’s try both and compare the results:
Linear model using country (summary has been left out for space):
# Try model with country variable and not region
m_country <- lm(sat_avg ~ ., data=x[,-which(names(x) %in% c("region"))])
Linear model using region (summary has been left out for space):
# Try model with region variable and not country
m_region <- lm(sat_avg ~ ., data=x[,-which(names(x) %in% c("country"))])
We can see from the two summaries, that a large amount of the country categories are significant, and all of the region categories are also significant. However, the R-squared value for the model using country data was 0.2449, and the R-squared value for the model with the region predictor was 0.1981. This means that removing some geographical information by condensing the countries into regions caused a drop in R-squared by roughly 19%, which is quite a bit. As a result, we are going to use the country data in the linear model.
We can also see that train_stat was not significant in either of the models, so let’s also remove that from future models.
Lastly, it seems very possible that age and years_teacher are correlated, since teachers that have been teaching for longer are likely to be older. We can also see that they seem quite correlated from the plot earlier of age against years_teacher. As a result, we probably don’t need both predictors in our linear model. Let’s check the correlation:
## [1] 0.8462086
This shows that age and years_teacher are pretty highly correlated (correlation of 0.846). As a result, let’s see if there’s another way we can use our age and years_teacher data, and combine it into one metric. Two options are, the age at which the teacher started teaching, and the ratio of years_teacher to age. Perhaps the time in your life when you start teaching may affect how satisfied you are. Of course, if we interpret this variable directly as the age teachers start teaching, we are making an implicit assumption that they never took a break/time off from teaching. It’s also possible that the fraction of your life that you have spent teaching, affects how satisfied you are.
Now let’s see whether our new predictors, or one of our older predictors, seems to have the clearest linear relationship with our response variable, sat_avg:
From the plots above, it looks like years_teacher has the clearest linear relationship with sat_avg, so we’re going to use that predictor in our linear model.
Let’s also make sure that the prep_A, prep_B, and prep_C predictors are not too highly correlated.
## prep_A prep_B prep_C
## prep_A 1.0000000 0.6463396 0.6254094
## prep_B 0.6463396 1.0000000 0.7381657
## prep_C 0.6254094 0.7381657 1.0000000
We can see from the correlation matrix above that all three prep questions are pretty correlated. As a result, let’s try creating a single predictor that’s the average of all three, in order to indicate their overall feeling of how prepared they were.
Now let’s make a new dataset with all of the variable selections we just made. So we need to remove region, train_stat, age, age_start, age_ratio, and all three prep predictors.
Now let’s see if all of the variables that we’ve decided to use for the linear model, are normally distributed, or if some transformations might be helpful:
We can see from the histogram of sat_avg above, that although it isn’t perfectly normal, it is close enough that it probably doesn’t warrant any transformation of the variable. Additionally, after we have a final model, we can check the qqnorm plot to make sure none of the normality assumptions are drastically violated.
Both of these histograms, although not perfectly normal, also don’t have a huge amount of skew. In fact, the histogram for time spent on reaching related activities has very little skew. As a result, we probably don’t need to transform either of these variables either.
Model without train_stat (summary removed from report for the sake of space):
m_final <- lm(sat_avg ~ ., data=x.new)
Before analyzing this model, let’s make sure that the assumptions for linear regression hold.
Now let’s analyze the coefficients of the model. We can see from the summary of m_final, that a lot of the coefficients have statistically significant p-values. However, this is somewhat expected given the large size of our dataset. In order to decide which predictors seem to be particularly helpful in predicting teacher satisfaction, let’s look at the magnitude of the coefficients, and the confidence intervals of the coefficients and see how far away they are from zero.
## 2.5 % 97.5 %
## (Intercept) 2.4409493899 2.4930517354
## countryAUS 0.1671017243 0.2191982708
## countryBFL 0.2180608930 0.2647446693
## countryBGR -0.0538969688 -0.0048729942
## countryBRA -0.0491803769 -0.0099267590
## countryCAB 0.1510620572 0.2046046494
## countryCHL 0.0842709971 0.1423484344
## countryCSH -0.1893090175 -0.1450183893
## countryCZE -0.1135712088 -0.0674291217
## countryDNK 0.1492583814 0.2042377021
## countryENG 0.0610273057 0.1105288540
## countryESP 0.1088670541 0.1547525172
## countryEST -0.1705123357 -0.1235392035
## countryFIN 0.2306808115 0.2792257572
## countryFRA 0.0217088368 0.0698893204
## countryGEO -0.0429452324 0.0130923462
## countryHRV -0.0279757893 0.0190993456
## countryISR 0.1179064500 0.1640745421
## countryITA 0.0271980845 0.0733943515
## countryJPN -0.0989857835 -0.0520627956
## countryKOR -0.0562273641 -0.0083916154
## countryLVA -0.1397995238 -0.0883318262
## countryMEX 0.4494836350 0.4990107641
## countryMYS 0.2013556169 0.2485775811
## countryNLD 0.1778618104 0.2306266470
## countryNOR 0.0405604252 0.0884971717
## countryNZL 0.1640289325 0.2128210410
## countryPOL -0.0517789334 -0.0069240896
## countryPRT -0.0189540002 0.0265089813
## countryROU -0.0751897414 -0.0294330413
## countryRUS 0.0090171858 0.0543119368
## countrySGP -0.0069788569 0.0398705791
## countrySRB 0.0055768722 0.0516989676
## countrySVK -0.2090032614 -0.1634454339
## countrySWE -0.1159457544 -0.0696628901
## countryUSA 0.1452273307 0.1973166893
## genderMale -0.0270671971 -0.0158848861
## years_teacher 0.0004463662 0.0009495282
## total_time -0.0006314828 -0.0003096251
## collab.L 0.5511808984 0.5707500823
## collab.Q 0.0160337232 0.0318429244
## collab.C 0.0039770994 0.0150788571
## num_subjects 0.0017310041 0.0054113949
## avg_prep 0.1103231565 0.1197998034
NOTE: need conclusion to this section!!!
To address our second research question, and try to identify the differences across different regions for teacher job satisfaction and other classroom factors, we are going to try several techniques.
First, we plot some maps to better visualize the teacher satisfaction across countries and by region
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## 36 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 207 codes from the map weren't represented in your data
Now, we plot the regions we categorized countries by, which we found using the World Health Organization data.
Also I hope this will look better once knitted but if not it will need to be tweaked.Sad!
## 36 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 207 codes from the map weren't represented in your data
## using catMethod='categorical' for non numeric data in mapCountryData
As we stated above, visually, while all median values for sat_avg hover around the range of 2.5 to 3.5, there is still a noticeable difference in some of the distributions (e.g. the median and both quartiles of the satisfaction value for North America is noticeably above that of Eastern Europe). However, we would classify each region’s average as “satisfied”, as it falls above 2.5
For collaboration we can see that there is a median response rate of 3 (agree) for all regions, but that they differ in the distribution of quartiles. East Mediterranean’s teachers had a quartile of responses in Strongly Agree, while North America, South America, and Western Europe had a quartile of responses in Disagree, and Eastern Europe, Northern Europe, Southern Europe, and Western Pacific had 50% of responses in Agree.
For prep_a, the question that asked teachers whether they feel prepared for the “content of the subjects that I teach”, we see that East Mediterranean and Western Europe are the slight outliers, with 75% of teachers feeling very well prepared in the former, and more than half the teachers feeling less than very well prepared in the later.
For prep_b, the question that asked teachers whether they feel prepared for the “pedagogy of the subjects I teach”, we see that East Mediterranean and Eastern Europe are the two regions that have “Very Well” rather than “Well” as the median.
For prep_c, the question that asked teachers whether they feel prepared for the “classroom practice in the subjects that I teach”, the regions are split equally, with East Mediterranean, Eastern Europe, South America, and Southern Europe respondents median “Very Well”, and the other 4 regions median “Well”
The median ages across regions range from mid 40s to high 30s, but the middle 50% of the data overlaps across all the regions. The same is true for total time working, which ranges from mid 30s to mid 40s, and years working as teacher, which ranges from mid to late teens.
## Analysis of Variance Table
##
## Response: as.numeric(collab)
## Df Sum Sq Mean Sq F value Pr(>F)
## region 7 916 130.871 274.64 < 2.2e-16 ***
## Residuals 112823 53763 0.477
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: as.numeric(prep_A)
## Df Sum Sq Mean Sq F value Pr(>F)
## region 7 1917 273.824 709.62 < 2.2e-16 ***
## Residuals 115316 44498 0.386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: as.numeric(prep_B)
## Df Sum Sq Mean Sq F value Pr(>F)
## region 7 2443 348.97 776.49 < 2.2e-16 ***
## Residuals 114181 51315 0.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: as.numeric(prep_C)
## Df Sum Sq Mean Sq F value Pr(>F)
## region 7 2743 391.87 826.62 < 2.2e-16 ***
## Residuals 114722 54385 0.47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: as.numeric(age)
## Df Sum Sq Mean Sq F value Pr(>F)
## region 7 454885 64984 591.13 < 2.2e-16 ***
## Residuals 117729 12941978 110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: as.numeric(total_time)
## Df Sum Sq Mean Sq F value Pr(>F)
## region 7 1319802 188543 697.46 < 2.2e-16 ***
## Residuals 114376 30919267 270
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: as.numeric(years_teacher)
## Df Sum Sq Mean Sq F value Pr(>F)
## region 7 265675 37954 352.71 < 2.2e-16 ***
## Residuals 109539 11787034 108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It looks like all of the variables vary across regions.
Linear model summaries removed from below for space.
##
## Call:
## lm(formula = sat_avg ~ region * as.numeric(collab), data = teacher_data)
##
## Coefficients:
## (Intercept)
## 2.305058
## regionEastern Europe
## -0.222979
## regionNorth America
## 0.336349
## regionNorthern Europe
## -0.128803
## regionSouth America
## -0.021319
## regionSouthern Europe
## -0.044542
## regionWestern Europe
## 0.031676
## regionWestern Pacific
## -0.210205
## as.numeric(collab)
## 0.251426
## regionEastern Europe:as.numeric(collab)
## 0.031849
## regionNorth America:as.numeric(collab)
## -0.047453
## regionNorthern Europe:as.numeric(collab)
## 0.024216
## regionSouth America:as.numeric(collab)
## -0.026088
## regionSouthern Europe:as.numeric(collab)
## -0.003595
## regionWestern Europe:as.numeric(collab)
## 0.001884
## regionWestern Pacific:as.numeric(collab)
## 0.044957
##
## Call:
## lm(formula = sat_avg ~ region * as.numeric(prep_A), data = teacher_data)
##
## Coefficients:
## (Intercept)
## 2.720271
## regionEastern Europe
## -0.103469
## regionNorth America
## 0.527881
## regionNorthern Europe
## 0.224803
## regionSouth America
## -0.127831
## regionSouthern Europe
## -0.028685
## regionWestern Europe
## 0.024295
## regionWestern Pacific
## -0.343304
## as.numeric(prep_A)
## 0.095677
## regionEastern Europe:as.numeric(prep_A)
## -0.009471
## regionNorth America:as.numeric(prep_A)
## -0.107714
## regionNorthern Europe:as.numeric(prep_A)
## -0.085275
## regionSouth America:as.numeric(prep_A)
## -0.008408
## regionSouthern Europe:as.numeric(prep_A)
## -0.018552
## regionWestern Europe:as.numeric(prep_A)
## -0.006167
## regionWestern Pacific:as.numeric(prep_A)
## 0.072536
##
## Call:
## lm(formula = sat_avg ~ region * as.numeric(prep_B), data = teacher_data)
##
## Coefficients:
## (Intercept)
## 2.639253
## regionEastern Europe
## -0.091273
## regionNorth America
## 0.539326
## regionNorthern Europe
## 0.197644
## regionSouth America
## -0.190891
## regionSouthern Europe
## -0.060986
## regionWestern Europe
## -0.055016
## regionWestern Pacific
## -0.282778
## as.numeric(prep_B)
## 0.121042
## regionEastern Europe:as.numeric(prep_B)
## -0.011002
## regionNorth America:as.numeric(prep_B)
## -0.112195
## regionNorthern Europe:as.numeric(prep_B)
## -0.076779
## regionSouth America:as.numeric(prep_B)
## 0.016084
## regionSouthern Europe:as.numeric(prep_B)
## -0.004576
## regionWestern Europe:as.numeric(prep_B)
## 0.033675
## regionWestern Pacific:as.numeric(prep_B)
## 0.062260
##
## Call:
## lm(formula = sat_avg ~ region * as.numeric(prep_C), data = teacher_data)
##
## Coefficients:
## (Intercept)
## 2.678712
## regionEastern Europe
## -0.100252
## regionNorth America
## 0.480853
## regionNorthern Europe
## 0.126972
## regionSouth America
## -0.245045
## regionSouthern Europe
## -0.074309
## regionWestern Europe
## -0.069686
## regionWestern Pacific
## -0.328525
## as.numeric(prep_C)
## 0.109496
## regionEastern Europe:as.numeric(prep_C)
## -0.008917
## regionNorth America:as.numeric(prep_C)
## -0.094951
## regionNorthern Europe:as.numeric(prep_C)
## -0.056835
## regionSouth America:as.numeric(prep_C)
## 0.025142
## regionSouthern Europe:as.numeric(prep_C)
## -0.001740
## regionWestern Europe:as.numeric(prep_C)
## 0.037648
## regionWestern Pacific:as.numeric(prep_C)
## 0.074378
##
## Call:
## lm(formula = sat_avg ~ region * as.numeric(age), data = teacher_data)
##
## Coefficients:
## (Intercept)
## 2.8828651
## regionEastern Europe
## -0.0300498
## regionNorth America
## 0.1507126
## regionNorthern Europe
## 0.1250463
## regionSouth America
## -0.1441969
## regionSouthern Europe
## 0.0590989
## regionWestern Europe
## 0.2539550
## regionWestern Pacific
## -0.0265292
## as.numeric(age)
## 0.0045099
## regionEastern Europe:as.numeric(age)
## -0.0026959
## regionNorth America:as.numeric(age)
## -0.0003197
## regionNorthern Europe:as.numeric(age)
## -0.0051077
## regionSouth America:as.numeric(age)
## -0.0002703
## regionSouthern Europe:as.numeric(age)
## -0.0037842
## regionWestern Europe:as.numeric(age)
## -0.0066228
## regionWestern Pacific:as.numeric(age)
## -0.0021435
##
## Call:
## lm(formula = sat_avg ~ region * as.numeric(total_time), data = teacher_data)
##
## Coefficients:
## (Intercept)
## 3.037744
## regionEastern Europe
## -0.112422
## regionNorth America
## 0.231097
## regionNorthern Europe
## 0.045919
## regionSouth America
## -0.133780
## regionSouthern Europe
## -0.054682
## regionWestern Europe
## 0.081277
## regionWestern Pacific
## -0.051541
## as.numeric(total_time)
## 0.001192
## regionEastern Europe:as.numeric(total_time)
## -0.001007
## regionNorth America:as.numeric(total_time)
## -0.002726
## regionNorthern Europe:as.numeric(total_time)
## -0.003834
## regionSouth America:as.numeric(total_time)
## -0.001092
## regionSouthern Europe:as.numeric(total_time)
## -0.001418
## regionWestern Europe:as.numeric(total_time)
## -0.003094
## regionWestern Pacific:as.numeric(total_time)
## -0.001933
##
## Call:
## lm(formula = sat_avg ~ region * as.numeric(years_teacher), data = teacher_data)
##
## Coefficients:
## (Intercept)
## 3.0115921
## regionEastern Europe
## -0.1197846
## regionNorth America
## 0.1161870
## regionNorthern Europe
## -0.0135535
## regionSouth America
## -0.1669378
## regionSouthern Europe
## -0.0479019
## regionWestern Europe
## 0.0682944
## regionWestern Pacific
## -0.0666693
## as.numeric(years_teacher)
## 0.0041929
## regionEastern Europe:as.numeric(years_teacher)
## -0.0020702
## regionNorth America:as.numeric(years_teacher)
## 0.0002722
## regionNorthern Europe:as.numeric(years_teacher)
## -0.0051955
## regionSouth America:as.numeric(years_teacher)
## -0.0002995
## regionSouthern Europe:as.numeric(years_teacher)
## -0.0036030
## regionWestern Europe:as.numeric(years_teacher)
## -0.0061760
## regionWestern Pacific:as.numeric(years_teacher)
## -0.0037246
After the analysis above is completed, we’ll run an ANOVA test to see if there are statistically significant variances in mean sat_avg across regions.
# Regression of sat_avg vs region
msatreg <- lm(sat_avg ~ region, data=teacher_data)
# ANOVA of regression
anova(msatreg)
## Analysis of Variance Table
##
## Response: sat_avg
## Df Sum Sq Mean Sq F value Pr(>F)
## region 7 601.8 85.965 395.98 < 2.2e-16 ***
## Residuals 113992 24747.4 0.217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We will also see if there are statistically significant variances across regions in any of the significant predictors from part (1) that were particularly predictive of teacher satisfaction.
## Analysis of Variance Table
##
## Response: sat_avg
## Df Sum Sq Mean Sq F value Pr(>F)
## region 7 601.8 85.965 395.98 < 2.2e-16 ***
## Residuals 113992 24747.4 0.217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s see how much of the R-squared in our model from part 1 was due to just the country predictor:
## [1] 0.08213555
0.08257/0.2474 is roughly 33.4%, therefore the country variable alone accounts for 33% of the variability that our final model in Part 1 was able to explain.
Overall, our models didn’t completely explain the variation in the data. We expected real datasets to be messier than the ones we had dealt with in problem sets, and so we were pleased with the observations we did find.
insert findings here lol
Before we started looking into the data, we expected there to be a lot more unsatisfied teachers than there actually were. Indeed, if we take sat_avg scores of above 2.5 to mean that the respondent is generally satisfied (if on average, they responded Agree or Strongly Agree more than Disagree or Strongly Disgree on the teacher satisfaction questions), then
nrow(teacher_data[teacher_data$sat_avg >=2.5,])/nrow(teacher_data)
## [1] 0.8844549
of our respondents were satisfied. Of course, this is a very low baseline (it just means the teacher didn’t respond “Agree” or “Strongly Agree” to a question like “I regret that I decided to become a teacher.”) If we were to collect the data particularly to understand how various factors affect teacher satisfication. we would want to get a more precise sense of how satisfied or unsatisfied teachers are about specific aspects of teaching. We might ask respondents to our survey to rate, on a scale from 1 - 10, their satisfaction with things like working conditions, pay, sense of impact, and more.
As far as concerns about the data, when we did our data cleaning, we did find some suspicious responses. There were things we could say were impossible, such as working longer than one has been alive, but we could have other incorrectly reported responses that we weren’t able to immediately mark as suspicious. Because all of the data was a self reported survey, inaccurate/falsely reported answers were pretty inevitable, but are still a concern about the data.
One area that our analysis would have benefited from other data sources would be countries. We created our regions variable using the World Health Organization regions categories, so that it was easier to visualize and interpret our results by group of countries. However, there are many other ways we could have tried to categorize the countries - for instance, because more collaborative school environments correlated with higher teacher satisfaction, we could have tried to group the countries by their rankings on the individualism vs collectivism scale, to see whether national attitudes made a satistically significant difference.
First, here is the data cleaning code from the Data Exploration Assignment
*note, we need to figure out how many code chunks in final, and then remove the last one here, because otherwise the datacleaning code will be printed twice
knitr::opts_chunk$set(echo = FALSE, warning=FALSE)
library("foreign")
library("tidyr")
library("countrycode")
library("rworldmap")
library(ggplot2)
library(reshape2)
library(FNN)
library(grid)
library(gridExtra)
library(leaps)
library(GGally)
library(ggmosaic)
library(plyr)
library("dplyr")
# To start, let's read in the cleaned datasets from our last submission:
teacher_data <- read.csv("TALIS.csv")
# Change column names of 10 satisfaction questions to make more readable
satcols <- grep("TT2G46", colnames(teacher_data))
indices <- satcols[satcols != which(colnames(teacher_data) == "TT2G46_avg")]
colnames(teacher_data)[indices] <- c("sat1", "sat2", "sat3", "sat4", "sat5", "sat6", "sat7", "sat8", "sat9", "sat10")
names(teacher_data)[names(teacher_data)=="TT2G46_avg"] <- "sat_avg"
# Column "TT2G15" contains no useful data, since it's all NAs, so we're going to remove it
# Column "ID" is no longer necessary, was used for making the subject dataset
teacher_data <- teacher_data[,-which(names(teacher_data) %in% c("TT2G15","ID"))]
# Change some of the subject names (one of the comments in our dataset submission was that they weren't clear):
names(teacher_data)[names(teacher_data) == "For_Lang"] <- "Modern_Lang"
names(teacher_data)[names(teacher_data) == "Gk_Ln"] <- "Ancient_Lang"
# We also thought it might be useful to have a variable that indicates the number of subjects a given teacher teaches, so let's add it below:
subject_names <- c("R_WR_Lit", "Math", "Science", "Soc_Studies", "Modern_Lang", "Ancient_Lang", "Tech", "Arts", "PE", "Rel_Ethics", "Practical", "Other")
subject_indices <- c(which(colnames(teacher_data) %in% subject_names))
teacher_data$num_subjects <- rowSums(teacher_data[,subject_indices], na.rm=TRUE)
# We discussed in our dataset submission why we weren't going to use their teacher satisfaction variable, and we are instead going to use the average of the 10 satisfaction questions, as a result, let's remove the teacher satisfaction variable:
teacher_data <- teacher_data[, -which(names(teacher_data) %in% c("satisfaction"))]
# A lot of the variables stored as categorical variables with the categories "Strongly Disagree", "Disagree", "Strongly Agree", and "Agree", are currently factors. Given that these categories do have a natural ordering, we're going to convert them into ordered factors.
teacher_data$prep_A <- factor(teacher_data$prep_A, levels=c("Not at all", "Somewhat", "Well", "Very well"), ordered=TRUE)
teacher_data$prep_B <- factor(teacher_data$prep_B, levels=c("Not at all", "Somewhat", "Well", "Very well"), ordered=TRUE)
teacher_data$prep_C <- factor(teacher_data$prep_C, levels=c("Not at all", "Somewhat", "Well", "Very well"), ordered=TRUE)
teacher_data$feedback_salary <- factor(teacher_data$feedback_salary, levels=c("No positive change", "A small change", "A moderate change", "A large change"), ordered=TRUE)
teacher_data$collab <- factor(teacher_data$collab, levels=c("Strongly disagree", "Disagree", "Agree", "Strongly agree"), ordered=TRUE)
# We'll split the country data into regions using the World Health Organization categories, so that we can try to find regional patterns
# Brazil, Chile
teacher_data$region[teacher_data$country %in% c("BRA", "CHL")] <- "South America"
# Canada, Mexico, United States
teacher_data$region[teacher_data$country %in% c("CAB", "MEX", "USA")] <- "North America"
# Bulgaria, Czech Republic, Poland, Romania, Russia, Slovak Republic
teacher_data$region[teacher_data$country %in%
c("BGR", "CZE", "POL", "ROU", "RUS", "SVK")] <- "Eastern Europe"
# Denmark, England, Estonia, Finland, Latvia, Norway, Sweden
teacher_data$region[teacher_data$country %in%
c("DNK", "ENG", "EST", "FIN", "LVA", "NOR", "SWE")] <- "Northern Europe"
# Spain, Croatia, Italy, Latvia, Portugal, Serbia,
teacher_data$region[teacher_data$country %in%
c("ESP", "HRV", "ITA", "LVA", "PRT", "SRB")] <- "Southern Europe"
# Belgium, France, Netherlands
teacher_data$region[teacher_data$country %in%
c("BFL", "FRA","NLD")] <- "Western Europe"
# Abu Dhabi, Georgia, Israel
teacher_data$region[teacher_data$country %in% c("AAD", "GEO", "ISR")] <- "East Mediterranean"
# Australia, China, Japan, South Korea, Malaysia, New Zealand, Singapore
teacher_data$region[teacher_data$country %in% c("AUS", "CSH", "JPN", "KOR", "MYS", "NZL", "SGP")] <- "Western Pacific"
teacher_data$region <- as.factor(teacher_data$region)
# Histogram of teacher satisfaction
p1 <- teacher_data %>% ggplot(aes(x=sat_avg)) + geom_histogram(binwidth = 0.2) + labs(title="Plot 1: Histogram of Teacher Satisfaction", x="Teacher Satisfaction (low to high)", y="Count")
# Boxplot of whether they consider their school a collaborative environment to satisfaction
p2 <- teacher_data %>% filter(!is.na(collab)) %>% ggplot(aes(x=collab, y=sat_avg)) + geom_boxplot() +
labs(title="Plot 2: Relationship Between Collaboration Level
of the School and Teacher Satisfaction",
x="Response to 'There is a collaborative school culture
which is characterised by mutual support'",
y="Teacher Satisfaction")
# Boxplot of region to satisfaction
p3 <- teacher_data %>% filter(!is.na(sat_avg)) %>% ggplot(aes(x=reorder(region, -sat_avg, median), y=sat_avg)) + geom_boxplot() + labs(title="Plot 3: Teacher Satisfaction by Region", x="Region", y="Teacher Satisfaction")
grid.arrange(p1, p2, ncol=2)
p3
teacher_data %>% ggplot(aes(x=total_time)) + geom_histogram() + labs(title="Histogram of Time Spent on Teaching Activities per Week", x="Time Spent on Teaching Activities (hrs/week)", y="Count")
length(teacher_data[teacher_data$total_time>=90 & !is.na(teacher_data$total_time),]$total_time)
length(teacher_data[teacher_data$total_time>90 & !is.na(teacher_data$total_time),]$total_time)
teacher_data[teacher_data$total_time > 90 & !is.na(teacher_data$total_time), "total_time"] <- NA
teacher_data %>% ggplot(aes(x=age, y=years_teacher)) + geom_point(alpha=0.5,shape=".") + geom_abline(intercept =0, slope=1, color="red") + geom_abline(intercept = -15, slope=1, color="yellow")+ geom_jitter(shape=".") + labs(title="Reported Age vs Years Teaching", x="Age (years)", y="Years Teaching (years)")
age_start <- teacher_data$age - teacher_data$years_teacher
nrow(teacher_data[age_start < 15 & !is.na(age_start),])
teacher_data[age_start < 15 & !is.na(age_start), "years_teacher"] <- NA
#ggpairs(teacher_data)
# In order to have easier control over which predictors to use in our model, below we create several lists of indices that we might want to include or exclude as a group
# We don't want to use the 10 satisfaction questions to predict sat_avg, since they were used to create the sat_avg predictor, so let's create a list of the indices we need to avoid
satcols <- grep("sat", colnames(teacher_data))
sat_indices <- satcols[satcols != which(colnames(teacher_data) == "sat_avg")]
subject_names <- c("R_WR_Lit", "Math", "Science", "Soc_Studies", "Modern_Lang", "Ancient_Lang", "Tech", "Arts", "PE", "Rel_Ethics", "Practical", "Other")
subject_indices <- c(which(colnames(teacher_data) %in% subject_names))
# We might not want to include the three questions about how prepared they felt:
prep_indices <- grep("prep_", names(teacher_data))
# Some columns also have many more NA values, lets make a list of the columns that have NA values in more than 10% of the data (this ends up being num_students and feedback_salary), that way we can choose whether or not to include them in our linear models
na_cols <- c()
for(i in 1:ncol(teacher_data)) {
if(sum(is.na(teacher_data[,i]))/nrow(teacher_data) > 0.10) {
na_cols <- c(na_cols,i)
}
}
#teacher_data[, -c(sat_indices, na_cols, subject_indices,1, 37)] %>%
# gather(-sat_avg, -gender, key = "var", value = "value") %>%
# ggplot(aes(x = value, y = sat_avg, color = gender)) +
# geom_point(alpha=0.05) +
# facet_wrap(~ var, scales = "free") +
# theme_bw()
x <- na.omit(teacher_data[, -c(sat_indices, na_cols, subject_indices)])
# Try model with country variable and not region
m_country <- lm(sat_avg ~ ., data=x[,-which(names(x) %in% c("region"))])
# Try model with region variable and not country
m_region <- lm(sat_avg ~ ., data=x[,-which(names(x) %in% c("country"))])
cor(teacher_data$age, teacher_data$years_teacher, use = "complete.obs")
teacher_data$age_start <- age_start
teacher_data$age_ratio <- teacher_data$years_teacher/teacher_data$age
plot1 <- teacher_data %>% ggplot(aes(x=age, y=sat_avg)) + geom_point(alpha=0.05)
plot2 <- teacher_data %>% ggplot(aes(x=years_teacher, y=sat_avg)) + geom_point(alpha=0.05)
plot3 <- teacher_data %>% ggplot(aes(x=age_start, y=sat_avg)) + geom_point(alpha=0.05)
plot4 <- teacher_data %>% ggplot(aes(x=age_ratio, y=sat_avg)) + geom_point(alpha=0.05)
grid.arrange(plot1, plot2, plot3, plot4, ncol=2)
data <- teacher_data[, grep("prep_", names(teacher_data))]
data <- data.frame(prep_A = as.numeric(teacher_data$prep_A),
prep_B = as.numeric(teacher_data$prep_B),
prep_C = as.numeric(teacher_data$prep_C))
cor(data, use = "complete.obs")
teacher_data$avg_prep <- apply(data, 1, mean)
x.new <- teacher_data[, -c(sat_indices, na_cols, subject_indices, prep_indices)]
x.new <- na.omit(x.new[, -which(names(x.new) %in% c("train_stat", "region", "age", "age_start", "age_ratio"))])
x.new %>% ggplot(aes(x=sat_avg)) + geom_histogram(binwidth = 0.2) + labs(title="Histogram of Teacher Satisfaction", x="Teacher Satisfaction (low to high)", y="Count")
#QUESTION: is there a way to do a histogram of all the variables in x?
p21 <- x.new %>% ggplot(aes(x=total_time)) + geom_histogram() + labs(title="Histogram of Time Spent on Teaching Activities", x="Time Spent on Teaching Related Activities", y="Count")
p22 <- x.new %>% ggplot(aes(x=years_teacher)) + geom_histogram() + labs(title="Histogram of Years Spent as a Teacher", x="Years of Teaching Experience", y="Count")
grid.arrange(p21, p22, ncol=2)
m_final <- lm(sat_avg ~ ., data=x.new)
plot(m_final)
confint(m_final)
# set to ISO codes
teacher_data$country <- revalue(teacher_data$country, c("AAD"="ARE", "BFL"="BEL","CAB"="CAN","CSH"="CHN","ENG"="GBR"))
# apparently plyr dependencies will mess things up
detach(package:plyr)
country_avg <- teacher_data %>% group_by(country) %>% summarize(mean_satisfaction = mean(sat_avg, na.rm=TRUE))
library(plyr); library(dplyr)
#code for map with average teacher satisfaction by country
sPDF <- joinCountryData2Map(country_avg, joinCode = "ISO3", nameJoinColumn = "country")
mapCountryData(sPDF, nameColumnToPlot = "mean_satisfaction")
country_region_pairs <- unique(teacher_data[c("country", "region")])
#code for map with average teacher satisfaction by country
sPDF <- joinCountryData2Map(country_region_pairs, joinCode = "ISO3", nameJoinColumn = "country")
mapParams <-mapCountryData(sPDF,
nameColumnToPlot = "region",
colourPalette="rainbow",
addLegend = FALSE,
mapTitle = "WHO Regions"
)
do.call(addMapLegendBoxes, c(mapParams, x='bottomleft',title="Region", cex=1.2))
## p3, used as an example plot earlier in this write-up, is a plot of sat_avg vs region
## will fix labels once we figure out how big it can be in the final writeup
p3
# collab plot
teacher_data %>% filter(!is.na(collab)) %>% ggplot(aes(x = product(collab, region), fill=factor(collab)), na.rm=TRUE) + geom_mosaic() + scale_fill_manual(values=c("red3","indianred1", "lightgreen", "green3")) + theme(axis.text.x=element_text(angle=-25, hjust= .1)) + labs(x="Region", title='How Collaborative is School Culture') + guides(fill=guide_legend(title = "Response", reverse = TRUE))
# prep plot
teacher_data[, c("prep_A", "prep_B", "prep_C","region")] %>%
gather(-region, key = "var", value = "value") %>%
mutate(value = factor(value, levels = c("Very Well", "Well", "Somewhat", "Not at all"))) %>%
ggplot(aes(x = product(value,region), fill=factor(value))) + geom_mosaic() +
theme(axis.text.x=element_text(angle=-25, hjust= .1)) + labs(x="Region", title='How Prepared the Teacher felt') + guides(fill=guide_legend(title = "Response", reverse=TRUE)) +
facet_wrap(~ var, scales = "free")
# gender plot
teacher_data %>% filter(!is.na(gender)) %>% ggplot(aes(x = product(gender, region), fill=factor(gender)), na.rm=TRUE) + geom_mosaic() + theme(axis.text.x=element_text(angle=-25, hjust= .1)) + labs(x="Region", title='Gender') + guides(fill=guide_legend(title = "Response", reverse = TRUE))
# age plot
teacher_data %>% filter(!is.na(age)) %>% ggplot(aes(x=reorder(region, -as.numeric(age), median), y=as.numeric(age))) + geom_boxplot() + xlab("Region") + ylab("Age")
# total_time plot
teacher_data %>% filter(!is.na(total_time)) %>% ggplot(aes(x=reorder(region, -as.numeric(total_time), median), y=as.numeric(total_time))) + geom_boxplot() + xlab("Region") + ylab("Total Time Working")
# age_start plot
teacher_data %>% filter(!is.na(age_start)) %>% ggplot(aes(x=reorder(region, -as.numeric(age_start), median), y=as.numeric(age_start))) + geom_boxplot() + xlab("Region") + ylab("Years not spent teaching")
# Check which variables have a significant correlation with region
anova(lm(as.numeric(collab) ~ region, data=teacher_data))
anova(lm(as.numeric(prep_A) ~ region, data=teacher_data))
anova(lm(as.numeric(prep_B) ~ region, data=teacher_data))
anova(lm(as.numeric(prep_C) ~ region, data=teacher_data))
anova(lm(as.numeric(age) ~ region, data=teacher_data))
anova(lm(as.numeric(total_time) ~ region, data=teacher_data))
anova(lm(as.numeric(years_teacher) ~ region, data=teacher_data))
# Collab and region LM
lm(sat_avg ~ region*as.numeric(collab), data=teacher_data)
# Collab and region plot
teacher_data %>% filter(!is.na(as.numeric(collab))) %>% ggplot(aes(x=as.numeric(collab), y=as.numeric(sat_avg))) + geom_point(alpha=0.05) + xlab("Collab") + ylab("Sat") + facet_wrap(~region)
# prep_A and region LM
lm(sat_avg ~ region*as.numeric(prep_A), data=teacher_data)
# prep_A and region plot
teacher_data %>% filter(!is.na(as.numeric(prep_A))) %>% ggplot(aes(x=as.numeric(prep_A), y=as.numeric(sat_avg))) + geom_point(alpha=0.05) + xlab("prep_A") + ylab("Sat") + facet_wrap(~region)
# prep_B and region LM
lm(sat_avg ~ region*as.numeric(prep_B), data=teacher_data)
# prep_B and region plot
teacher_data %>% filter(!is.na(as.numeric(prep_B))) %>% ggplot(aes(x=as.numeric(prep_B), y=as.numeric(sat_avg))) + geom_point(alpha=0.05) + xlab("prep_B") + ylab("Sat") + facet_wrap(~region)
# prep_C and region LM
lm(sat_avg ~ region*as.numeric(prep_C), data=teacher_data)
# prep_C and region plot
teacher_data %>% filter(!is.na(as.numeric(prep_C))) %>% ggplot(aes(x=as.numeric(prep_C), y=as.numeric(sat_avg))) + geom_point(alpha=0.05) + xlab("prep_C") + ylab("Sat") + facet_wrap(~region)
# age and region LM
lm(sat_avg ~ region*as.numeric(age), data=teacher_data)
# age and region plot
teacher_data %>% filter(!is.na(as.numeric(age))) %>% ggplot(aes(x=as.numeric(age), y=as.numeric(sat_avg))) + geom_point(alpha=0.05) + xlab("Age") + ylab("Sat") + facet_wrap(~region)
# total_time and region LM
lm(sat_avg ~ region*as.numeric(total_time), data=teacher_data)
# total_time and region plot
teacher_data %>% filter(!is.na(as.numeric(total_time))) %>% ggplot(aes(x=as.numeric(total_time), y=as.numeric(sat_avg))) + geom_point(alpha=0.05) + xlab("Total Time Working") + ylab("Sat") + facet_wrap(~region)
# years_working and region LM
lm(sat_avg ~ region*as.numeric(years_teacher), data=teacher_data)
# years_teacher and region plot
teacher_data %>% filter(!is.na(as.numeric(years_teacher))) %>% ggplot(aes(x=as.numeric(years_teacher), y=as.numeric(sat_avg))) + geom_point(alpha=0.05) + xlab("Years Working As Teacher") + ylab("Sat") + facet_wrap(~region)
# Regression of sat_avg vs region
msatreg <- lm(sat_avg ~ region, data=teacher_data)
# ANOVA of regression
anova(msatreg)
# Template code for anova of predictor against region TODO: switch collab with predictive factors after found:
mcollabreg <- lm(collab ~ region, data=teacher_data)
anova(msatreg)
m_countryonly <- lm(I(sat_avg^2) ~ country, data=x.new)
summary(m_countryonly)$r.squared
nrow(teacher_data[teacher_data$sat_avg >=2.5,])/nrow(teacher_data)
# Lower-secondary-school teacher data (international file, all countries) is the file titled BTGINTT2
lower_second_data <- read.spss("SPSS_International/BTGINTT2.sav", to.data.frame=TRUE, use.missings=TRUE)
cols <- c("CNTRY", "TT2G01", "TT2G02", "TT2G05B", "TT2G11", "TT2G13A", "TT2G13B", "TT2G13C", "TT2G16", "TT2G30G", "TT2G38", "TT2G44E", "TJOBSATS")
# We are creating a dataset with the columns from our project proposal, the 10 questions related to job satisfaction, which are columns TT2G46A through TT2G46J, as well as a few additional columns which we also think might be useful
data <- cbind(lower_second_data[,cols],lower_second_data[,grep("TT2G46", colnames(lower_second_data))], lower_second_data[,grep("TT2G15", colnames(lower_second_data))])
# Rename some of the columns so that they're easier to work with
data <- data %>% rename(gender=TT2G01, age=TT2G02, country=CNTRY, years_teacher = TT2G05B, train_stat = TT2G11, prep_A=TT2G13A, prep_B=TT2G13B, prep_C=TT2G13C, total_time=TT2G16, feedback_salary=TT2G30G, collab=TT2G44E, num_students=TT2G38, satisfaction=TJOBSATS)
# Columns that start with TT2G15 correspond to several questions asking what subjects teachers teach:
#Let's manipulate the subject data so that it's in a form that's useful to us
# First make an ID column to merge back after isolating the subject as one column
data$ID <- seq.int(nrow(data))
# Indices of columns dealing with subject taught
indices <- grep("TT2G15", colnames(data))[-1] #-1 to deal with the TT2G15 column that isn't for a specific subject
# Convert "Yes" "No" levels to 0 for didn't teach subject and 1 for taught subject
for (i in indices) {
data[,i] <- 2 - as.numeric(data[,i])
}
# Rename columns by subject
data <- data %>% rename(R_WR_Lit=TT2G15A, Math=TT2G15B, Science=TT2G15C, Soc_Studies = TT2G15D, For_Lang = TT2G15E, Gk_Ln=TT2G15F, Tech=TT2G15G, Arts=TT2G15H, PE=TT2G15I, Rel_Ethics=TT2G15J, Practical=TT2G15K, Other=TT2G15L)
# Names of columns
col_names <- colnames(data)[indices]
# Below is just the R code for extracting the subject column
# This is separated for clarity since it is extensive in and of itself
df <- data[25:37]
# Melt to format -- similar to "gather"
temp_by_subj <- melt(df,id="ID")
# Only keep 1's (subjects taught)
temp_by_subj <- temp_by_subj[which(temp_by_subj$value==1),]
# Merge the subject table in with data
data_by_subject <- merge(data,temp_by_subj,all.y=T, by = "ID")
# Clean up unnecessary columns
data_by_subject <- data_by_subject[,-(c(15:24,26:37,39))]
data_by_subject <- data_by_subject %>% rename(Subject=variable)
# Below is the code used to clean up the satisfaction related questions, as well as a few other columns
# All ten columns with responses related to job satisfaction are factors, with 4 levels: "Strongly Disagree", "Disagree", "Agree", "Strongly Agree".
# We are going to convert these 10 columns to numerical data, where the factors get converted to the integers 1, 2, 3, and 4, with 1 corresponding to Strongly Disagree and 4 corresponding to Strongly Agree.
for (i in grep("TT2G46", colnames(data))) {
data[,i] <- as.numeric(data[,i])
}
# Questions C, D, and F were negatively phrased questions, so we are reversing their scales so we can compare them to the positively phrased questions.
data$TT2G46C <- 5 - data$TT2G46C
data$TT2G46D <- 5 - data$TT2G46D
data$TT2G46F <- 5 - data$TT2G46F
# Questions TT2G46A through J were all related to teacher satisfaction. Consequently, we are going to create a column that averages the results of these 10 questions to get a sense of teacher satisfaction
data$TT2G46_avg <- rowMeans(data[,grep("TT2G46", colnames(data))], na.rm=TRUE)
# Clean up country column, which has a lot of whitespace
data$country <- as.factor(trimws(as.character(data$country)))
# Clean up train_stat variable so it's easier to work with
# Make No the first factor, so that if we convert it to numeric, it's easy to have 0 represent No and 1 represent Yes
data$train_stat <- relevel(data$train_stat, "No")